import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.cluster import KMeans, AgglomerativeClustering
pd.set_option('display.float_format', lambda x: '%.3f' % x)
#show all columns
pd.set_option('display.max_columns', None)
rank = pd.read_csv('colleges.csv')
rank.head()
| Unnamed: 0 | College Name | Tuition | Enrollment Numbers | |
|---|---|---|---|---|
| 0 | 0 | Princeton University | 56010 | 4773 |
| 1 | 1 | Columbia University | 63530 | 6170 |
| 2 | 2 | Harvard University | 55587 | 5222 |
| 3 | 3 | Massachusetts Institute of Technology | 55878 | 4361 |
| 4 | 4 | Yale University | 59950 | 4703 |
rank.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 392 entries, 0 to 391 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 392 non-null int64 1 College Name 392 non-null object 2 Tuition 392 non-null int64 3 Enrollment Numbers 392 non-null int64 dtypes: int64(3), object(1) memory usage: 12.4+ KB
rank.rename(columns={'Unnamed: 0':'Rank','College Name':'Name'}, inplace=True)
college_data = pd.read_csv('Data-Table 1.csv')
college_data.head()
| Name | Applicants total | Admissions total | Enrolled total | Percent of freshmen submitting SAT scores | Percent of freshmen submitting ACT scores | SAT Critical Reading 25th percentile score | SAT Critical Reading 75th percentile score | SAT Math 25th percentile score | SAT Math 75th percentile score | SAT Writing 25th percentile score | SAT Writing 75th percentile score | ACT Composite 25th percentile score | ACT Composite 75th percentile score | State abbreviation | Geographic region | Control of institution | Historically Black College or University | Degree of urbanization (Urban centric locale) | Carnegie Classification 2010: Basic | Total enrollment | Full time enrollment | Part time enrollment | Undergraduate enrollment | Graduate enrollment | Full time undergraduate enrollment | Part time undergraduate enrollment | Percent of total enrollment that are Asian | Percent of total enrollment that are Black or African American | Percent of total enrollment that are Hispanic/Latino | Percent of total enrollment that are Native Hawaiian or Other Pacific Islander | Percent of total enrollment that are White | Percent of total enrollment that are two or more races | Percent of total enrollment that are Nonresident Alien | Percent of total enrollment that are women | Percent of undergraduate enrollment that are American Indian or Alaska Native | Number of first time undergraduates in state | Number of first time undergraduates out of state | Number of first time undergraduates foreign countries | Number of first time undergraduates residence unknown | Graduation rate Bachelor degree within 4 years, total | Graduation rate Bachelor degree within 5 years, total | Graduation rate Bachelor degree within 6 years, total | Percent of freshmen receiving any financial aid | Percent of freshmen receiving federal grant aid | Percent of freshmen receiving Pell grants | Percent of freshmen receiving institutional grant aid | Percent of freshmen receiving student loan aid | Endowment assets | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama A & M University | 6142.000 | 5521.000 | 1104.000 | 15.000 | 88.000 | 370.000 | 450.000 | 350.000 | 450.000 | NaN | NaN | 15.000 | 19.000 | Alabama | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Public | Yes | City: Midsize | Master's Colleges and Universities (larger pro... | 5020.000 | 4439.000 | 581.000 | 4051.000 | 969.000 | 3799.000 | 252.000 | 1.000 | 92.000 | 1.000 | 0.000 | 5.000 | 0.000 | 0.000 | 55.000 | 0.000 | NaN | NaN | NaN | NaN | 10.000 | 23.000 | 29.000 | 97.000 | 81.000 | 81.000 | 32.000 | 89.000 | 0 |
| 1 | University of Alabama at Birmingham | 5689.000 | 4934.000 | 1773.000 | 6.000 | 93.000 | 520.000 | 640.000 | 520.000 | 650.000 | NaN | NaN | 22.000 | 28.000 | Alabama | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Public | No | City: Midsize | Research Universities (very high research acti... | 18568.000 | 11961.000 | 6607.000 | 11502.000 | 7066.000 | 8357.000 | 3145.000 | 5.000 | 21.000 | 3.000 | 0.000 | 64.000 | 3.000 | 3.000 | 61.000 | 0.000 | 1529.000 | 224.000 | 19.000 | 1.000 | 29.000 | 46.000 | 53.000 | 90.000 | 36.000 | 36.000 | 60.000 | 56.000 | 24136 |
| 2 | Amridge University | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Alabama | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Private not for profit | No | City: Midsize | Baccalaureate Colleges Arts & Sciences | 631.000 | 323.000 | 308.000 | 322.000 | 309.000 | 202.000 | 120.000 | 0.000 | 40.000 | 1.000 | 0.000 | 30.000 | 0.000 | 0.000 | 58.000 | 0.000 | NaN | NaN | NaN | NaN | 0.000 | 0.000 | 67.000 | 100.000 | 90.000 | 90.000 | 90.000 | 100.000 | 302 |
| 3 | University of Alabama at Huntsville | 2054.000 | 1656.000 | 651.000 | 34.000 | 94.000 | 510.000 | 640.000 | 510.000 | 650.000 | NaN | NaN | 23.000 | 29.000 | Alabama | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Public | No | City: Midsize | Research Universities (very high research acti... | 7376.000 | 4802.000 | 2574.000 | 5696.000 | 1680.000 | 4237.000 | 1459.000 | 4.000 | 12.000 | 3.000 | 0.000 | 69.000 | 1.000 | 6.000 | 44.000 | 1.000 | 514.000 | 92.000 | 27.000 | 18.000 | 16.000 | 37.000 | 48.000 | 87.000 | 31.000 | 31.000 | 63.000 | 46.000 | 11502 |
| 4 | Alabama State University | 10245.000 | 5251.000 | 1479.000 | 18.000 | 87.000 | 380.000 | 480.000 | 370.000 | 480.000 | NaN | NaN | 15.000 | 19.000 | Alabama | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Public | Yes | City: Midsize | Master's Colleges and Universities (larger pro... | 6075.000 | 5182.000 | 893.000 | 5356.000 | 719.000 | 4872.000 | 484.000 | 0.000 | 91.000 | 1.000 | 0.000 | 3.000 | 1.000 | 2.000 | 61.000 | 0.000 | 903.000 | 571.000 | 67.000 | 4.000 | 9.000 | 19.000 | 25.000 | 93.000 | 76.000 | 76.000 | 34.000 | 81.000 | 13202 |
#nulls for each column
college_data.isnull().sum()
Name 0 Applicants total 157 Admissions total 157 Enrolled total 157 Percent of freshmen submitting SAT scores 277 Percent of freshmen submitting ACT scores 275 SAT Critical Reading 25th percentile score 365 SAT Critical Reading 75th percentile score 365 SAT Math 25th percentile score 352 SAT Math 75th percentile score 352 SAT Writing 25th percentile score 820 SAT Writing 75th percentile score 820 ACT Composite 25th percentile score 335 ACT Composite 75th percentile score 335 State abbreviation 0 Geographic region 0 Control of institution 0 Historically Black College or University 0 Degree of urbanization (Urban centric locale) 0 Carnegie Classification 2010: Basic 0 Total enrollment 2 Full time enrollment 2 Part time enrollment 2 Undergraduate enrollment 2 Graduate enrollment 2 Full time undergraduate enrollment 2 Part time undergraduate enrollment 2 Percent of total enrollment that are Asian 2 Percent of total enrollment that are Black or African American 2 Percent of total enrollment that are Hispanic/Latino 2 Percent of total enrollment that are Native Hawaiian or Other Pacific Islander 2 Percent of total enrollment that are White 2 Percent of total enrollment that are two or more races 2 Percent of total enrollment that are Nonresident Alien 2 Percent of total enrollment that are women 2 Percent of undergraduate enrollment that are American Indian or Alaska Native 12 Number of first time undergraduates in state 623 Number of first time undergraduates out of state 623 Number of first time undergraduates foreign countries 623 Number of first time undergraduates residence unknown 623 Graduation rate Bachelor degree within 4 years, total 58 Graduation rate Bachelor degree within 5 years, total 58 Graduation rate Bachelor degree within 6 years, total 58 Percent of freshmen receiving any financial aid 42 Percent of freshmen receiving federal grant aid 42 Percent of freshmen receiving Pell grants 42 Percent of freshmen receiving institutional grant aid 42 Percent of freshmen receiving student loan aid 42 Endowment assets 0 dtype: int64
college_data.columns
Index(['Name', 'Applicants total', 'Admissions total', 'Enrolled total',
'Percent of freshmen submitting SAT scores',
'Percent of freshmen submitting ACT scores',
'SAT Critical Reading 25th percentile score',
'SAT Critical Reading 75th percentile score',
'SAT Math 25th percentile score', 'SAT Math 75th percentile score',
'SAT Writing 25th percentile score',
'SAT Writing 75th percentile score',
'ACT Composite 25th percentile score',
'ACT Composite 75th percentile score', 'State abbreviation',
'Geographic region', 'Control of institution',
'Historically Black College or University',
'Degree of urbanization (Urban centric locale)',
'Carnegie Classification 2010: Basic', 'Total enrollment',
'Full time enrollment', 'Part time enrollment',
'Undergraduate enrollment', 'Graduate enrollment',
'Full time undergraduate enrollment',
'Part time undergraduate enrollment',
'Percent of total enrollment that are Asian',
'Percent of total enrollment that are Black or African American',
'Percent of total enrollment that are Hispanic/Latino',
'Percent of total enrollment that are Native Hawaiian or Other Pacific Islander',
'Percent of total enrollment that are White',
'Percent of total enrollment that are two or more races',
'Percent of total enrollment that are Nonresident Alien',
'Percent of total enrollment that are women',
'Percent of undergraduate enrollment that are American Indian or Alaska Native',
'Number of first time undergraduates in state',
'Number of first time undergraduates out of state',
'Number of first time undergraduates foreign countries',
'Number of first time undergraduates residence unknown',
'Graduation rate Bachelor degree within 4 years, total',
'Graduation rate Bachelor degree within 5 years, total',
'Graduation rate Bachelor degree within 6 years, total',
'Percent of freshmen receiving any financial aid',
'Percent of freshmen receiving federal grant aid',
'Percent of freshmen receiving Pell grants',
'Percent of freshmen receiving institutional grant aid',
'Percent of freshmen receiving student loan aid', 'Endowment assets'],
dtype='object')
college_data['Name'] = college_data['Name'].str.replace('[#,@,&,+,*,%,$,^,!,~,.]', '')
college_data['Name'] = college_data['Name'].str.replace('The ', '')
college_data['Name'] = college_data['Name'].str.replace(' at ', ' ')
college_data['Name'] = college_data['Name'].str.replace('Main Campus', '')
college_data['Name'] = college_data['Name'].str.replace(' and ',' ')
college_data['Name'] = college_data['Name'].str.replace(' ', ' ')
college_data['Name'] = college_data['Name'].str.strip()
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\904618103.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
college_data['Name'] = college_data['Name'].str.replace('[#,@,&,+,*,%,$,^,!,~,.]', '')
rank['Name'] = rank['Name'].str.replace('[#,@,&,+,*,%,$,^,!,~,.]', '')
rank['Name'] = rank['Name'].str.replace('The ', '')
rank['Name'] = rank['Name'].str.replace(' at ', ' ')
rank['Name'] = rank['Name'].str.replace(' and ',' ')
rank['Name'] = rank['Name'].str.replace('--', ' ')
rank['Name'] = rank['Name'].str.replace('-', ' ')
rank['Name'] = rank['Name'].str.replace(' ', ' ')
rank['Name'] = rank['Name'].str.strip()
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\3916203632.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
rank['Name'] = rank['Name'].str.replace('[#,@,&,+,*,%,$,^,!,~,.]', '')
#merge the two dataframes on the college name
data = pd.merge(rank, college_data, on='Name', how='outer')
data.head()
| Rank | Name | Tuition | Enrollment Numbers | Applicants total | Admissions total | Enrolled total | Percent of freshmen submitting SAT scores | Percent of freshmen submitting ACT scores | SAT Critical Reading 25th percentile score | SAT Critical Reading 75th percentile score | SAT Math 25th percentile score | SAT Math 75th percentile score | SAT Writing 25th percentile score | SAT Writing 75th percentile score | ACT Composite 25th percentile score | ACT Composite 75th percentile score | State abbreviation | Geographic region | Control of institution | Historically Black College or University | Degree of urbanization (Urban centric locale) | Carnegie Classification 2010: Basic | Total enrollment | Full time enrollment | Part time enrollment | Undergraduate enrollment | Graduate enrollment | Full time undergraduate enrollment | Part time undergraduate enrollment | Percent of total enrollment that are Asian | Percent of total enrollment that are Black or African American | Percent of total enrollment that are Hispanic/Latino | Percent of total enrollment that are Native Hawaiian or Other Pacific Islander | Percent of total enrollment that are White | Percent of total enrollment that are two or more races | Percent of total enrollment that are Nonresident Alien | Percent of total enrollment that are women | Percent of undergraduate enrollment that are American Indian or Alaska Native | Number of first time undergraduates in state | Number of first time undergraduates out of state | Number of first time undergraduates foreign countries | Number of first time undergraduates residence unknown | Graduation rate Bachelor degree within 4 years, total | Graduation rate Bachelor degree within 5 years, total | Graduation rate Bachelor degree within 6 years, total | Percent of freshmen receiving any financial aid | Percent of freshmen receiving federal grant aid | Percent of freshmen receiving Pell grants | Percent of freshmen receiving institutional grant aid | Percent of freshmen receiving student loan aid | Endowment assets | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000 | Princeton University | 56010.000 | 4773.000 | 26499.000 | 1963.000 | 1285.000 | 86.000 | 33.000 | 700.000 | 800.000 | 710.000 | 800.000 | 710.000 | 790.000 | 31.000 | 35.000 | New Jersey | Mid East DE DC MD NJ NY PA | Private not for profit | No | Suburb: Large | Research Universities (very high research acti... | 8014.000 | 7935.000 | 79.000 | 5323.000 | 2691.000 | 5244.000 | 79.000 | 15.000 | 6.000 | 7.000 | 0.000 | 45.000 | 4.000 | 20.000 | 45.000 | 0.000 | 197.000 | 929.000 | 157.000 | 1.000 | 88.000 | 95.000 | 97.000 | 60.000 | 14.000 | 14.000 | 60.000 | 9.000 | 2320421.000 |
| 1 | 1.000 | Columbia University | 63530.000 | 6170.000 | 31851.000 | 2362.000 | 1415.000 | 90.000 | 32.000 | 690.000 | 780.000 | 700.000 | 790.000 | 690.000 | 780.000 | 31.000 | 34.000 | New York | Mid East DE DC MD NJ NY PA | Private not for profit | No | City: Large | Research Universities (very high research acti... | 26957.000 | 22731.000 | 4226.000 | 7970.000 | 18987.000 | 7374.000 | 596.000 | 13.000 | 5.000 | 8.000 | 0.000 | 36.000 | 3.000 | 28.000 | 51.000 | 1.000 | 324.000 | 961.000 | 224.000 | 0.000 | 86.000 | 92.000 | 93.000 | 57.000 | 15.000 | 15.000 | 49.000 | 16.000 | 316753.000 |
| 2 | 2.000 | Harvard University | 55587.000 | 5222.000 | 35023.000 | 2047.000 | 1659.000 | 86.000 | 38.000 | 700.000 | 800.000 | 710.000 | 800.000 | 710.000 | 800.000 | 32.000 | 35.000 | Massachusetts | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | 28297.000 | 20370.000 | 7927.000 | 10534.000 | 17763.000 | 7240.000 | 3294.000 | 13.000 | 5.000 | 7.000 | 0.000 | 45.000 | 3.000 | 21.000 | 49.000 | 0.000 | NaN | NaN | NaN | NaN | 87.000 | 95.000 | 97.000 | 75.000 | 15.000 | 15.000 | 58.000 | 9.000 | 1392761.000 |
| 3 | 3.000 | Massachusetts Institute of Technology | 55878.000 | 4361.000 | 18989.000 | 1548.000 | 1115.000 | 85.000 | 40.000 | 680.000 | 770.000 | 750.000 | 800.000 | 690.000 | 780.000 | 33.000 | 35.000 | Massachusetts | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | 11301.000 | 11138.000 | 163.000 | 4528.000 | 6773.000 | 4499.000 | 29.000 | 16.000 | 3.000 | 9.000 | 0.000 | 34.000 | 3.000 | 29.000 | 37.000 | 0.000 | 75.000 | 929.000 | 110.000 | 1.000 | 84.000 | 91.000 | 93.000 | 87.000 | 18.000 | 16.000 | 55.000 | 19.000 | 980404.000 |
| 4 | 4.000 | Yale University | 59950.000 | 4703.000 | 28977.000 | 2043.000 | 1356.000 | 84.000 | 35.000 | 700.000 | 800.000 | 710.000 | 790.000 | 710.000 | 800.000 | 32.000 | 35.000 | Connecticut | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | 12109.000 | 11927.000 | 182.000 | 5430.000 | 6679.000 | 5424.000 | 6.000 | 13.000 | 5.000 | 7.000 | 0.000 | 48.000 | 4.000 | 18.000 | 49.000 | 1.000 | 82.000 | 1113.000 | 162.000 | 1.000 | 90.000 | 96.000 | 98.000 | 61.000 | 13.000 | 13.000 | 50.000 | 6.000 | 1528324.000 |
just_rank = data[(data['Rank'].notnull())&(data['Applicants total'].isnull())]
just_rank.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 27 entries, 52 to 392 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 27 non-null float64 1 Name 27 non-null object 2 Tuition 27 non-null float64 3 Enrollment Numbers 27 non-null float64 4 Applicants total 0 non-null float64 5 Admissions total 0 non-null float64 6 Enrolled total 0 non-null float64 7 Percent of freshmen submitting SAT scores 0 non-null float64 8 Percent of freshmen submitting ACT scores 0 non-null float64 9 SAT Critical Reading 25th percentile score 0 non-null float64 10 SAT Critical Reading 75th percentile score 0 non-null float64 11 SAT Math 25th percentile score 0 non-null float64 12 SAT Math 75th percentile score 0 non-null float64 13 SAT Writing 25th percentile score 0 non-null float64 14 SAT Writing 75th percentile score 0 non-null float64 15 ACT Composite 25th percentile score 0 non-null float64 16 ACT Composite 75th percentile score 0 non-null float64 17 State abbreviation 8 non-null object 18 Geographic region 8 non-null object 19 Control of institution 8 non-null object 20 Historically Black College or University 8 non-null object 21 Degree of urbanization (Urban centric locale) 8 non-null object 22 Carnegie Classification 2010: Basic 8 non-null object 23 Total enrollment 8 non-null float64 24 Full time enrollment 8 non-null float64 25 Part time enrollment 8 non-null float64 26 Undergraduate enrollment 8 non-null float64 27 Graduate enrollment 8 non-null float64 28 Full time undergraduate enrollment 8 non-null float64 29 Part time undergraduate enrollment 8 non-null float64 30 Percent of total enrollment that are Asian 8 non-null float64 31 Percent of total enrollment that are Black or African American 8 non-null float64 32 Percent of total enrollment that are Hispanic/Latino 8 non-null float64 33 Percent of total enrollment that are Native Hawaiian or Other Pacific Islander 8 non-null float64 34 Percent of total enrollment that are White 8 non-null float64 35 Percent of total enrollment that are two or more races 8 non-null float64 36 Percent of total enrollment that are Nonresident Alien 8 non-null float64 37 Percent of total enrollment that are women 8 non-null float64 38 Percent of undergraduate enrollment that are American Indian or Alaska Native 8 non-null float64 39 Number of first time undergraduates in state 4 non-null float64 40 Number of first time undergraduates out of state 4 non-null float64 41 Number of first time undergraduates foreign countries 4 non-null float64 42 Number of first time undergraduates residence unknown 4 non-null float64 43 Graduation rate Bachelor degree within 4 years, total 7 non-null float64 44 Graduation rate Bachelor degree within 5 years, total 7 non-null float64 45 Graduation rate Bachelor degree within 6 years, total 7 non-null float64 46 Percent of freshmen receiving any financial aid 7 non-null float64 47 Percent of freshmen receiving federal grant aid 7 non-null float64 48 Percent of freshmen receiving Pell grants 7 non-null float64 49 Percent of freshmen receiving institutional grant aid 7 non-null float64 50 Percent of freshmen receiving student loan aid 7 non-null float64 51 Endowment assets 8 non-null float64 dtypes: float64(45), object(7) memory usage: 11.2+ KB
just_rank = just_rank[just_rank['Endowment assets'].isnull()]
just_rank['Name'].unique()
array(['Purdue University West Lafayette',
'Pennsylvania State University University Park',
'University of California Merced', 'Thomas Jefferson University',
'Russell Sage College',
'Inter American University of Puerto Rico San German',
'Tennessee Techn University', 'Long Island University',
'University of Puerto Rico Rio Piedras', 'Augusta University',
'Colorado Technical University', 'Grand Canyon University',
'Inter American University of Puerto Rico Metropolitan Campus',
'Keiser University', 'Mary Baldwin University',
'Pontifical Catholic University of Puerto Rico Ponce',
'Universidad Ana G Mendez Gurabo Campus', 'University of Phoenix',
'University of Texas Rio Grande Valley'], dtype=object)
data.drop(just_rank.index, inplace=True)
data[data['Name']=='University of Texas Rio Grande Valley']
| Rank | Name | Tuition | Enrollment Numbers | Applicants total | Admissions total | Enrolled total | Percent of freshmen submitting SAT scores | Percent of freshmen submitting ACT scores | SAT Critical Reading 25th percentile score | SAT Critical Reading 75th percentile score | SAT Math 25th percentile score | SAT Math 75th percentile score | SAT Writing 25th percentile score | SAT Writing 75th percentile score | ACT Composite 25th percentile score | ACT Composite 75th percentile score | State abbreviation | Geographic region | Control of institution | Historically Black College or University | Degree of urbanization (Urban centric locale) | Carnegie Classification 2010: Basic | Total enrollment | Full time enrollment | Part time enrollment | Undergraduate enrollment | Graduate enrollment | Full time undergraduate enrollment | Part time undergraduate enrollment | Percent of total enrollment that are Asian | Percent of total enrollment that are Black or African American | Percent of total enrollment that are Hispanic/Latino | Percent of total enrollment that are Native Hawaiian or Other Pacific Islander | Percent of total enrollment that are White | Percent of total enrollment that are two or more races | Percent of total enrollment that are Nonresident Alien | Percent of total enrollment that are women | Percent of undergraduate enrollment that are American Indian or Alaska Native | Number of first time undergraduates in state | Number of first time undergraduates out of state | Number of first time undergraduates foreign countries | Number of first time undergraduates residence unknown | Graduation rate Bachelor degree within 4 years, total | Graduation rate Bachelor degree within 5 years, total | Graduation rate Bachelor degree within 6 years, total | Percent of freshmen receiving any financial aid | Percent of freshmen receiving federal grant aid | Percent of freshmen receiving Pell grants | Percent of freshmen receiving institutional grant aid | Percent of freshmen receiving student loan aid | Endowment assets |
|---|
just_data = data[(data['Rank'].isnull())&(data['Applicants total'].notnull())]
just_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1010 entries, 394 to 1551 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 0 non-null float64 1 Name 1010 non-null object 2 Tuition 0 non-null float64 3 Enrollment Numbers 0 non-null float64 4 Applicants total 1010 non-null float64 5 Admissions total 1010 non-null float64 6 Enrolled total 1010 non-null float64 7 Percent of freshmen submitting SAT scores 903 non-null float64 8 Percent of freshmen submitting ACT scores 904 non-null float64 9 SAT Critical Reading 25th percentile score 832 non-null float64 10 SAT Critical Reading 75th percentile score 832 non-null float64 11 SAT Math 25th percentile score 840 non-null float64 12 SAT Math 75th percentile score 840 non-null float64 13 SAT Writing 25th percentile score 507 non-null float64 14 SAT Writing 75th percentile score 507 non-null float64 15 ACT Composite 25th percentile score 858 non-null float64 16 ACT Composite 75th percentile score 858 non-null float64 17 State abbreviation 1010 non-null object 18 Geographic region 1010 non-null object 19 Control of institution 1010 non-null object 20 Historically Black College or University 1010 non-null object 21 Degree of urbanization (Urban centric locale) 1010 non-null object 22 Carnegie Classification 2010: Basic 1010 non-null object 23 Total enrollment 1010 non-null float64 24 Full time enrollment 1010 non-null float64 25 Part time enrollment 1010 non-null float64 26 Undergraduate enrollment 1010 non-null float64 27 Graduate enrollment 1010 non-null float64 28 Full time undergraduate enrollment 1010 non-null float64 29 Part time undergraduate enrollment 1010 non-null float64 30 Percent of total enrollment that are Asian 1010 non-null float64 31 Percent of total enrollment that are Black or African American 1010 non-null float64 32 Percent of total enrollment that are Hispanic/Latino 1010 non-null float64 33 Percent of total enrollment that are Native Hawaiian or Other Pacific Islander 1010 non-null float64 34 Percent of total enrollment that are White 1010 non-null float64 35 Percent of total enrollment that are two or more races 1010 non-null float64 36 Percent of total enrollment that are Nonresident Alien 1010 non-null float64 37 Percent of total enrollment that are women 1010 non-null float64 38 Percent of undergraduate enrollment that are American Indian or Alaska Native 1010 non-null float64 39 Number of first time undergraduates in state 591 non-null float64 40 Number of first time undergraduates out of state 591 non-null float64 41 Number of first time undergraduates foreign countries 591 non-null float64 42 Number of first time undergraduates residence unknown 591 non-null float64 43 Graduation rate Bachelor degree within 4 years, total 1002 non-null float64 44 Graduation rate Bachelor degree within 5 years, total 1002 non-null float64 45 Graduation rate Bachelor degree within 6 years, total 1002 non-null float64 46 Percent of freshmen receiving any financial aid 1007 non-null float64 47 Percent of freshmen receiving federal grant aid 1007 non-null float64 48 Percent of freshmen receiving Pell grants 1007 non-null float64 49 Percent of freshmen receiving institutional grant aid 1007 non-null float64 50 Percent of freshmen receiving student loan aid 1007 non-null float64 51 Endowment assets 1010 non-null float64 dtypes: float64(45), object(7) memory usage: 418.2+ KB
complete = data[(data['Rank'].notnull())&(data['Applicants total'].notnull())]
complete.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 367 entries, 0 to 393 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 367 non-null float64 1 Name 367 non-null object 2 Tuition 367 non-null float64 3 Enrollment Numbers 367 non-null float64 4 Applicants total 367 non-null float64 5 Admissions total 367 non-null float64 6 Enrolled total 367 non-null float64 7 Percent of freshmen submitting SAT scores 354 non-null float64 8 Percent of freshmen submitting ACT scores 355 non-null float64 9 SAT Critical Reading 25th percentile score 337 non-null float64 10 SAT Critical Reading 75th percentile score 337 non-null float64 11 SAT Math 25th percentile score 342 non-null float64 12 SAT Math 75th percentile score 342 non-null float64 13 SAT Writing 25th percentile score 207 non-null float64 14 SAT Writing 75th percentile score 207 non-null float64 15 ACT Composite 25th percentile score 341 non-null float64 16 ACT Composite 75th percentile score 341 non-null float64 17 State abbreviation 367 non-null object 18 Geographic region 367 non-null object 19 Control of institution 367 non-null object 20 Historically Black College or University 367 non-null object 21 Degree of urbanization (Urban centric locale) 367 non-null object 22 Carnegie Classification 2010: Basic 367 non-null object 23 Total enrollment 367 non-null float64 24 Full time enrollment 367 non-null float64 25 Part time enrollment 367 non-null float64 26 Undergraduate enrollment 367 non-null float64 27 Graduate enrollment 367 non-null float64 28 Full time undergraduate enrollment 367 non-null float64 29 Part time undergraduate enrollment 367 non-null float64 30 Percent of total enrollment that are Asian 367 non-null float64 31 Percent of total enrollment that are Black or African American 367 non-null float64 32 Percent of total enrollment that are Hispanic/Latino 367 non-null float64 33 Percent of total enrollment that are Native Hawaiian or Other Pacific Islander 367 non-null float64 34 Percent of total enrollment that are White 367 non-null float64 35 Percent of total enrollment that are two or more races 367 non-null float64 36 Percent of total enrollment that are Nonresident Alien 367 non-null float64 37 Percent of total enrollment that are women 367 non-null float64 38 Percent of undergraduate enrollment that are American Indian or Alaska Native 367 non-null float64 39 Number of first time undergraduates in state 268 non-null float64 40 Number of first time undergraduates out of state 268 non-null float64 41 Number of first time undergraduates foreign countries 268 non-null float64 42 Number of first time undergraduates residence unknown 268 non-null float64 43 Graduation rate Bachelor degree within 4 years, total 365 non-null float64 44 Graduation rate Bachelor degree within 5 years, total 365 non-null float64 45 Graduation rate Bachelor degree within 6 years, total 365 non-null float64 46 Percent of freshmen receiving any financial aid 366 non-null float64 47 Percent of freshmen receiving federal grant aid 366 non-null float64 48 Percent of freshmen receiving Pell grants 366 non-null float64 49 Percent of freshmen receiving institutional grant aid 366 non-null float64 50 Percent of freshmen receiving student loan aid 366 non-null float64 51 Endowment assets 367 non-null float64 dtypes: float64(45), object(7) memory usage: 152.0+ KB
just_rank.to_csv('just_rank.csv')
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1534 entries, 0 to 1552 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 375 non-null float64 1 Name 1534 non-null object 2 Tuition 375 non-null float64 3 Enrollment Numbers 375 non-null float64 4 Applicants total 1377 non-null float64 5 Admissions total 1377 non-null float64 6 Enrolled total 1377 non-null float64 7 Percent of freshmen submitting SAT scores 1257 non-null float64 8 Percent of freshmen submitting ACT scores 1259 non-null float64 9 SAT Critical Reading 25th percentile score 1169 non-null float64 10 SAT Critical Reading 75th percentile score 1169 non-null float64 11 SAT Math 25th percentile score 1182 non-null float64 12 SAT Math 75th percentile score 1182 non-null float64 13 SAT Writing 25th percentile score 714 non-null float64 14 SAT Writing 75th percentile score 714 non-null float64 15 ACT Composite 25th percentile score 1199 non-null float64 16 ACT Composite 75th percentile score 1199 non-null float64 17 State abbreviation 1534 non-null object 18 Geographic region 1534 non-null object 19 Control of institution 1534 non-null object 20 Historically Black College or University 1534 non-null object 21 Degree of urbanization (Urban centric locale) 1534 non-null object 22 Carnegie Classification 2010: Basic 1534 non-null object 23 Total enrollment 1532 non-null float64 24 Full time enrollment 1532 non-null float64 25 Part time enrollment 1532 non-null float64 26 Undergraduate enrollment 1532 non-null float64 27 Graduate enrollment 1532 non-null float64 28 Full time undergraduate enrollment 1532 non-null float64 29 Part time undergraduate enrollment 1532 non-null float64 30 Percent of total enrollment that are Asian 1532 non-null float64 31 Percent of total enrollment that are Black or African American 1532 non-null float64 32 Percent of total enrollment that are Hispanic/Latino 1532 non-null float64 33 Percent of total enrollment that are Native Hawaiian or Other Pacific Islander 1532 non-null float64 34 Percent of total enrollment that are White 1532 non-null float64 35 Percent of total enrollment that are two or more races 1532 non-null float64 36 Percent of total enrollment that are Nonresident Alien 1532 non-null float64 37 Percent of total enrollment that are women 1532 non-null float64 38 Percent of undergraduate enrollment that are American Indian or Alaska Native 1522 non-null float64 39 Number of first time undergraduates in state 911 non-null float64 40 Number of first time undergraduates out of state 911 non-null float64 41 Number of first time undergraduates foreign countries 911 non-null float64 42 Number of first time undergraduates residence unknown 911 non-null float64 43 Graduation rate Bachelor degree within 4 years, total 1476 non-null float64 44 Graduation rate Bachelor degree within 5 years, total 1476 non-null float64 45 Graduation rate Bachelor degree within 6 years, total 1476 non-null float64 46 Percent of freshmen receiving any financial aid 1492 non-null float64 47 Percent of freshmen receiving federal grant aid 1492 non-null float64 48 Percent of freshmen receiving Pell grants 1492 non-null float64 49 Percent of freshmen receiving institutional grant aid 1492 non-null float64 50 Percent of freshmen receiving student loan aid 1492 non-null float64 51 Endowment assets 1534 non-null float64 dtypes: float64(45), object(7) memory usage: 635.2+ KB
data.dropna(subset=['Rank'], inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 375 entries, 0 to 393 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 375 non-null float64 1 Name 375 non-null object 2 Tuition 375 non-null float64 3 Enrollment Numbers 375 non-null float64 4 Applicants total 367 non-null float64 5 Admissions total 367 non-null float64 6 Enrolled total 367 non-null float64 7 Percent of freshmen submitting SAT scores 354 non-null float64 8 Percent of freshmen submitting ACT scores 355 non-null float64 9 SAT Critical Reading 25th percentile score 337 non-null float64 10 SAT Critical Reading 75th percentile score 337 non-null float64 11 SAT Math 25th percentile score 342 non-null float64 12 SAT Math 75th percentile score 342 non-null float64 13 SAT Writing 25th percentile score 207 non-null float64 14 SAT Writing 75th percentile score 207 non-null float64 15 ACT Composite 25th percentile score 341 non-null float64 16 ACT Composite 75th percentile score 341 non-null float64 17 State abbreviation 375 non-null object 18 Geographic region 375 non-null object 19 Control of institution 375 non-null object 20 Historically Black College or University 375 non-null object 21 Degree of urbanization (Urban centric locale) 375 non-null object 22 Carnegie Classification 2010: Basic 375 non-null object 23 Total enrollment 375 non-null float64 24 Full time enrollment 375 non-null float64 25 Part time enrollment 375 non-null float64 26 Undergraduate enrollment 375 non-null float64 27 Graduate enrollment 375 non-null float64 28 Full time undergraduate enrollment 375 non-null float64 29 Part time undergraduate enrollment 375 non-null float64 30 Percent of total enrollment that are Asian 375 non-null float64 31 Percent of total enrollment that are Black or African American 375 non-null float64 32 Percent of total enrollment that are Hispanic/Latino 375 non-null float64 33 Percent of total enrollment that are Native Hawaiian or Other Pacific Islander 375 non-null float64 34 Percent of total enrollment that are White 375 non-null float64 35 Percent of total enrollment that are two or more races 375 non-null float64 36 Percent of total enrollment that are Nonresident Alien 375 non-null float64 37 Percent of total enrollment that are women 375 non-null float64 38 Percent of undergraduate enrollment that are American Indian or Alaska Native 375 non-null float64 39 Number of first time undergraduates in state 272 non-null float64 40 Number of first time undergraduates out of state 272 non-null float64 41 Number of first time undergraduates foreign countries 272 non-null float64 42 Number of first time undergraduates residence unknown 272 non-null float64 43 Graduation rate Bachelor degree within 4 years, total 372 non-null float64 44 Graduation rate Bachelor degree within 5 years, total 372 non-null float64 45 Graduation rate Bachelor degree within 6 years, total 372 non-null float64 46 Percent of freshmen receiving any financial aid 373 non-null float64 47 Percent of freshmen receiving federal grant aid 373 non-null float64 48 Percent of freshmen receiving Pell grants 373 non-null float64 49 Percent of freshmen receiving institutional grant aid 373 non-null float64 50 Percent of freshmen receiving student loan aid 373 non-null float64 51 Endowment assets 375 non-null float64 dtypes: float64(45), object(7) memory usage: 155.3+ KB
#columns with no nulls
data.columns[data.isnull().sum()<=30]
Index(['Rank', 'Name', 'Tuition', 'Enrollment Numbers', 'Applicants total',
'Admissions total', 'Enrolled total',
'Percent of freshmen submitting SAT scores',
'Percent of freshmen submitting ACT scores', 'State abbreviation',
'Geographic region', 'Control of institution',
'Historically Black College or University',
'Degree of urbanization (Urban centric locale)',
'Carnegie Classification 2010: Basic', 'Total enrollment',
'Full time enrollment', 'Part time enrollment',
'Undergraduate enrollment', 'Graduate enrollment',
'Full time undergraduate enrollment',
'Part time undergraduate enrollment',
'Percent of total enrollment that are Asian',
'Percent of total enrollment that are Black or African American',
'Percent of total enrollment that are Hispanic/Latino',
'Percent of total enrollment that are Native Hawaiian or Other Pacific Islander',
'Percent of total enrollment that are White',
'Percent of total enrollment that are two or more races',
'Percent of total enrollment that are Nonresident Alien',
'Percent of total enrollment that are women',
'Percent of undergraduate enrollment that are American Indian or Alaska Native',
'Graduation rate Bachelor degree within 4 years, total',
'Graduation rate Bachelor degree within 5 years, total',
'Graduation rate Bachelor degree within 6 years, total',
'Percent of freshmen receiving any financial aid',
'Percent of freshmen receiving federal grant aid',
'Percent of freshmen receiving Pell grants',
'Percent of freshmen receiving institutional grant aid',
'Percent of freshmen receiving student loan aid', 'Endowment assets'],
dtype='object')
df = data[[ 'Name','Rank','Tuition', 'Enrollment Numbers', 'Applicants total',
'Admissions total',
'Geographic region', 'Control of institution',
'Historically Black College or University',
'Degree of urbanization (Urban centric locale)',
'Carnegie Classification 2010: Basic',
'Undergraduate enrollment', 'Graduate enrollment',
'Full time undergraduate enrollment',
'Part time undergraduate enrollment',
'Percent of total enrollment that are Asian',
'Percent of total enrollment that are Black or African American',
'Percent of total enrollment that are Hispanic/Latino',
'Percent of total enrollment that are Native Hawaiian or Other Pacific Islander',
'Percent of total enrollment that are White',
'Percent of total enrollment that are women',
'Percent of undergraduate enrollment that are American Indian or Alaska Native',
'Graduation rate Bachelor degree within 4 years, total',
'Percent of freshmen receiving any financial aid',
'Endowment assets']]
df = df.dropna()
df.dtypes
Name object Rank float64 Tuition float64 Enrollment Numbers float64 Applicants total float64 Admissions total float64 Geographic region object Control of institution object Historically Black College or University object Degree of urbanization (Urban centric locale) object Carnegie Classification 2010: Basic object Undergraduate enrollment float64 Graduate enrollment float64 Full time undergraduate enrollment float64 Part time undergraduate enrollment float64 Percent of total enrollment that are Asian float64 Percent of total enrollment that are Black or African American float64 Percent of total enrollment that are Hispanic/Latino float64 Percent of total enrollment that are Native Hawaiian or Other Pacific Islander float64 Percent of total enrollment that are White float64 Percent of total enrollment that are women float64 Percent of undergraduate enrollment that are American Indian or Alaska Native float64 Graduation rate Bachelor degree within 4 years, total float64 Percent of freshmen receiving any financial aid float64 Endowment assets float64 dtype: object
df
| Name | Rank | Tuition | Enrollment Numbers | Applicants total | Admissions total | Geographic region | Control of institution | Historically Black College or University | Degree of urbanization (Urban centric locale) | Carnegie Classification 2010: Basic | Undergraduate enrollment | Graduate enrollment | Full time undergraduate enrollment | Part time undergraduate enrollment | Percent of total enrollment that are Asian | Percent of total enrollment that are Black or African American | Percent of total enrollment that are Hispanic/Latino | Percent of total enrollment that are Native Hawaiian or Other Pacific Islander | Percent of total enrollment that are White | Percent of total enrollment that are women | Percent of undergraduate enrollment that are American Indian or Alaska Native | Graduation rate Bachelor degree within 4 years, total | Percent of freshmen receiving any financial aid | Endowment assets | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Princeton University | 0.000 | 56010.000 | 4773.000 | 26499.000 | 1963.000 | Mid East DE DC MD NJ NY PA | Private not for profit | No | Suburb: Large | Research Universities (very high research acti... | 5323.000 | 2691.000 | 5244.000 | 79.000 | 15.000 | 6.000 | 7.000 | 0.000 | 45.000 | 45.000 | 0.000 | 88.000 | 60.000 | 2320421.000 |
| 1 | Columbia University | 1.000 | 63530.000 | 6170.000 | 31851.000 | 2362.000 | Mid East DE DC MD NJ NY PA | Private not for profit | No | City: Large | Research Universities (very high research acti... | 7970.000 | 18987.000 | 7374.000 | 596.000 | 13.000 | 5.000 | 8.000 | 0.000 | 36.000 | 51.000 | 1.000 | 86.000 | 57.000 | 316753.000 |
| 2 | Harvard University | 2.000 | 55587.000 | 5222.000 | 35023.000 | 2047.000 | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | 10534.000 | 17763.000 | 7240.000 | 3294.000 | 13.000 | 5.000 | 7.000 | 0.000 | 45.000 | 49.000 | 0.000 | 87.000 | 75.000 | 1392761.000 |
| 3 | Massachusetts Institute of Technology | 3.000 | 55878.000 | 4361.000 | 18989.000 | 1548.000 | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | 4528.000 | 6773.000 | 4499.000 | 29.000 | 16.000 | 3.000 | 9.000 | 0.000 | 34.000 | 37.000 | 0.000 | 84.000 | 87.000 | 980404.000 |
| 4 | Yale University | 4.000 | 59950.000 | 4703.000 | 28977.000 | 2043.000 | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | 5430.000 | 6679.000 | 5424.000 | 6.000 | 13.000 | 5.000 | 7.000 | 0.000 | 48.000 | 49.000 | 1.000 | 90.000 | 61.000 | 1528324.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 386 | Western Kentucky University | 384.000 | 26496.000 | 15286.000 | 8526.000 | 7871.000 | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Public | No | City: Small | Master's Colleges and Universities (larger pro... | 17509.000 | 2939.000 | 13382.000 | 4127.000 | 1.000 | 10.000 | 2.000 | 0.000 | 77.000 | 58.000 | 0.000 | 25.000 | 93.000 | 945.000 |
| 387 | Wichita State University | 385.000 | 18166.000 | 12406.000 | 3492.000 | 3344.000 | Plains IA KS MN MO NE ND SD | Public | No | City: Large | Research Universities (high research activity) | 11670.000 | 2716.000 | 8807.000 | 2863.000 | 6.000 | 6.000 | 8.000 | 0.000 | 63.000 | 52.000 | 1.000 | 22.000 | 89.000 | 17845.000 |
| 388 | William Carey University | 386.000 | 14100.000 | 3264.000 | 771.000 | 376.000 | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Private not for profit | No | City: Small | Master's Colleges and Universities (larger pro... | 2257.000 | 1625.000 | 1886.000 | 371.000 | 3.000 | 27.000 | 2.000 | 0.000 | 64.000 | 66.000 | 1.000 | 46.000 | 92.000 | 2958.000 |
| 389 | William Woods University | 387.000 | 25930.000 | 873.000 | 897.000 | 674.000 | Plains IA KS MN MO NE ND SD | Private not for profit | No | Town: Distant | Master's Colleges and Universities (larger pro... | 1002.000 | 1134.000 | 843.000 | 159.000 | 1.000 | 3.000 | 1.000 | 0.000 | 84.000 | 68.000 | 0.000 | 45.000 | 100.000 | 11097.000 |
| 391 | Wingate University | 389.000 | 40170.000 | 2683.000 | 5323.000 | 4221.000 | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Private not for profit | No | Suburb: Large | Master's Colleges and Universities (smaller pr... | 2009.000 | 993.000 | 1953.000 | 56.000 | 2.000 | 14.000 | 2.000 | 0.000 | 62.000 | 60.000 | 1.000 | 47.000 | 99.000 | 17933.000 |
365 rows × 25 columns
df['Enrollment Numbers'] = df['Undergraduate enrollment'] + df['Graduate enrollment']
df['Percent undergraduate'] = df['Undergraduate enrollment']/(df['Enrollment Numbers'])
df['Percent fulltime'] = df['Full time undergraduate enrollment']/(df['Undergraduate enrollment'])
df['Percent admitted'] = df['Admissions total']/(df['Applicants total'])
df['Geographic region'].unique()
array(['Mid East DE DC MD NJ NY PA', 'New England CT ME MA NH RI VT',
'Far West AK CA HI NV OR WA', 'Great Lakes IL IN MI OH WI',
'Southeast AL AR FL GA KY LA MS NC SC TN VA WV',
'Plains IA KS MN MO NE ND SD', 'Southwest AZ NM OK TX',
'Rocky Mountains CO ID MT UT WY'], dtype=object)
df['Control of institution'].unique()
array(['Private not for profit', 'Public'], dtype=object)
df['Historically Black College or University'].unique()
array(['No', 'Yes'], dtype=object)
df['Degree of urbanization (Urban centric locale)'].unique()
array(['Suburb: Large', 'City: Large', 'City: Midsize', 'City: Small',
'Town: Remote', 'Suburb: Midsize', 'Suburb: Small', 'Town: Fringe',
'Rural: Fringe', 'Town: Distant'], dtype=object)
df['Carnegie Classification 2010: Basic'].unique()
array(['Research Universities (very high research activity)',
'Research Universities (high research activity)',
'Doctoral/Research Universities',
"Master's Colleges and Universities (larger programs)",
"Master's Colleges and Universities (smaller programs)",
"Master's Colleges and Universities (medium programs)",
'Baccalaureate Colleges Diverse Fields',
'Baccalaureate Colleges Arts & Sciences'], dtype=object)
df.columns
Index(['Name', 'Rank', 'Tuition', 'Enrollment Numbers', 'Applicants total',
'Admissions total', 'Geographic region', 'Control of institution',
'Historically Black College or University',
'Degree of urbanization (Urban centric locale)',
'Carnegie Classification 2010: Basic', 'Undergraduate enrollment',
'Graduate enrollment', 'Full time undergraduate enrollment',
'Part time undergraduate enrollment',
'Percent of total enrollment that are Asian',
'Percent of total enrollment that are Black or African American',
'Percent of total enrollment that are Hispanic/Latino',
'Percent of total enrollment that are Native Hawaiian or Other Pacific Islander',
'Percent of total enrollment that are White',
'Percent of total enrollment that are women',
'Percent of undergraduate enrollment that are American Indian or Alaska Native',
'Graduation rate Bachelor degree within 4 years, total',
'Percent of freshmen receiving any financial aid', 'Endowment assets',
'Percent undergraduate', 'Percent fulltime', 'Percent admitted'],
dtype='object')
df.describe()
| Rank | Tuition | Enrollment Numbers | Applicants total | Admissions total | Undergraduate enrollment | Graduate enrollment | Full time undergraduate enrollment | Part time undergraduate enrollment | Percent of total enrollment that are Asian | Percent of total enrollment that are Black or African American | Percent of total enrollment that are Hispanic/Latino | Percent of total enrollment that are Native Hawaiian or Other Pacific Islander | Percent of total enrollment that are White | Percent of total enrollment that are women | Percent of undergraduate enrollment that are American Indian or Alaska Native | Graduation rate Bachelor degree within 4 years, total | Percent of freshmen receiving any financial aid | Endowment assets | Percent undergraduate | Percent fulltime | Percent admitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 |
| mean | 187.227 | 34593.723 | 16898.238 | 13867.986 | 7260.882 | 12326.797 | 4571.441 | 10476.556 | 1850.241 | 6.192 | 10.605 | 8.907 | 0.074 | 58.414 | 55.340 | 0.321 | 42.118 | 86.299 | 62104.348 | 0.708 | 0.852 | 0.622 |
| std | 110.507 | 13681.923 | 12481.095 | 12524.285 | 5869.563 | 9643.301 | 3992.240 | 8268.793 | 2311.730 | 6.417 | 13.882 | 9.987 | 0.609 | 18.201 | 9.366 | 0.744 | 21.973 | 12.326 | 199648.073 | 0.135 | 0.112 | 0.208 |
| min | 0.000 | -1.000 | 1225.000 | 84.000 | 83.000 | 973.000 | 214.000 | 611.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 23.000 | 0.000 | 0.000 | 44.000 | 0.000 | 0.201 | 0.373 | 0.057 |
| 25% | 93.000 | 24110.000 | 6747.000 | 4801.000 | 2764.000 | 4428.000 | 1887.000 | 3815.000 | 359.000 | 2.000 | 4.000 | 3.000 | 0.000 | 48.000 | 51.000 | 0.000 | 24.000 | 79.000 | 6274.000 | 0.632 | 0.785 | 0.507 |
| 50% | 185.000 | 32299.000 | 13868.000 | 10525.000 | 5529.000 | 9718.000 | 3335.000 | 8040.000 | 1091.000 | 4.000 | 6.000 | 6.000 | 0.000 | 62.000 | 55.000 | 0.000 | 38.000 | 90.000 | 14574.000 | 0.745 | 0.880 | 0.655 |
| 75% | 277.000 | 44382.000 | 24629.000 | 18989.000 | 10405.000 | 18615.000 | 5789.000 | 15879.000 | 2664.000 | 8.000 | 12.000 | 10.000 | 0.000 | 72.000 | 60.000 | 0.000 | 59.000 | 96.000 | 36852.000 | 0.813 | 0.937 | 0.774 |
| max | 389.000 | 63530.000 | 77338.000 | 72676.000 | 35815.000 | 51333.000 | 29874.000 | 40020.000 | 21553.000 | 39.000 | 91.000 | 79.000 | 10.000 | 94.000 | 95.000 | 6.000 | 90.000 | 100.000 | 2320421.000 | 0.918 | 1.000 | 1.000 |
df_small = df[['Name', 'Rank', 'Tuition', 'Enrollment Numbers', 'Geographic region', 'Control of institution',
'Historically Black College or University',
'Degree of urbanization (Urban centric locale)',
'Carnegie Classification 2010: Basic',
'Percent of total enrollment that are White',
'Percent of total enrollment that are women',
'Graduation rate Bachelor degree within 4 years, total',
'Percent of freshmen receiving any financial aid', 'Endowment assets',
'Percent undergraduate', 'Percent fulltime', 'Percent admitted']]
Clustering
sns.pairplot(df_small)
<seaborn.axisgrid.PairGrid at 0x1b6d90a7650>
sns.pairplot(df_small, hue = 'Control of institution')
<seaborn.axisgrid.PairGrid at 0x1b6ad740b50>
Private - public split does have a pretty clear distinction for a lot of the graphs, particularly for tuition, enrollment size, and percent undergraduate
sns.pairplot(df_small, hue = 'Degree of urbanization (Urban centric locale)')
<seaborn.axisgrid.PairGrid at 0x1b6b639fb50>
Strong blend -- no super distinct splits.
sns.pairplot(df_small, hue = 'Carnegie Classification 2010: Basic')
<seaborn.axisgrid.PairGrid at 0x1b706e85290>
Very high research universities tend to appear near each other on many graphs, as do high research unviersities and large masters programs. However, there is still a lot of blending and non-destinct clustering.
sns.pairplot(df_small, hue = 'Geographic region')
<seaborn.axisgrid.PairGrid at 0x1b6999b5d50>
Very blended, little to no distinct splits.
Correlations -- many things correlate with "total," which makes sense since that is just the sum of all the stats
df_small['Geographic region num'] = df_small['Geographic region'].astype('category').cat.codes
df_small['Control of institution num'] = df_small['Control of institution'].astype('category').cat.codes
df_small['Historically Black College or University num'] = df_small['Historically Black College or University'].astype('category').cat.codes
df_small['Degree of urbanization (Urban centric locale) num'] = df_small['Degree of urbanization (Urban centric locale)'].astype('category').cat.codes
df_small['Carnegie Classification 2010: Basic num'] = df_small['Carnegie Classification 2010: Basic'].astype('category').cat.codes
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\2467532273.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_small['Geographic region num'] = df_small['Geographic region'].astype('category').cat.codes
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\2467532273.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_small['Control of institution num'] = df_small['Control of institution'].astype('category').cat.codes
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\2467532273.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_small['Historically Black College or University num'] = df_small['Historically Black College or University'].astype('category').cat.codes
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\2467532273.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_small['Degree of urbanization (Urban centric locale) num'] = df_small['Degree of urbanization (Urban centric locale)'].astype('category').cat.codes
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\2467532273.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_small['Carnegie Classification 2010: Basic num'] = df_small['Carnegie Classification 2010: Basic'].astype('category').cat.codes
df_small
| Name | Rank | Tuition | Enrollment Numbers | Geographic region | Control of institution | Historically Black College or University | Degree of urbanization (Urban centric locale) | Carnegie Classification 2010: Basic | Percent of total enrollment that are White | Percent of total enrollment that are women | Graduation rate Bachelor degree within 4 years, total | Percent of freshmen receiving any financial aid | Endowment assets | Percent undergraduate | Percent fulltime | Percent admitted | Geographic region num | Control of institution num | Historically Black College or University num | Degree of urbanization (Urban centric locale) num | Carnegie Classification 2010: Basic num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Princeton University | 0.000 | 56010.000 | 8014.000 | Mid East DE DC MD NJ NY PA | Private not for profit | No | Suburb: Large | Research Universities (very high research acti... | 45.000 | 45.000 | 88.000 | 60.000 | 2320421.000 | 0.664 | 0.985 | 0.074 | 2 | 0 | 0 | 4 | 7 |
| 1 | Columbia University | 1.000 | 63530.000 | 26957.000 | Mid East DE DC MD NJ NY PA | Private not for profit | No | City: Large | Research Universities (very high research acti... | 36.000 | 51.000 | 86.000 | 57.000 | 316753.000 | 0.296 | 0.925 | 0.074 | 2 | 0 | 0 | 0 | 7 |
| 2 | Harvard University | 2.000 | 55587.000 | 28297.000 | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | 45.000 | 49.000 | 87.000 | 75.000 | 1392761.000 | 0.372 | 0.687 | 0.058 | 3 | 0 | 0 | 1 | 7 |
| 3 | Massachusetts Institute of Technology | 3.000 | 55878.000 | 11301.000 | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | 34.000 | 37.000 | 84.000 | 87.000 | 980404.000 | 0.401 | 0.994 | 0.082 | 3 | 0 | 0 | 1 | 7 |
| 4 | Yale University | 4.000 | 59950.000 | 12109.000 | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | 48.000 | 49.000 | 90.000 | 61.000 | 1528324.000 | 0.448 | 0.999 | 0.071 | 3 | 0 | 0 | 1 | 7 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 386 | Western Kentucky University | 384.000 | 26496.000 | 20448.000 | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Public | No | City: Small | Master's Colleges and Universities (larger pro... | 77.000 | 58.000 | 25.000 | 93.000 | 945.000 | 0.856 | 0.764 | 0.923 | 6 | 1 | 0 | 2 | 3 |
| 387 | Wichita State University | 385.000 | 18166.000 | 14386.000 | Plains IA KS MN MO NE ND SD | Public | No | City: Large | Research Universities (high research activity) | 63.000 | 52.000 | 22.000 | 89.000 | 17845.000 | 0.811 | 0.755 | 0.958 | 4 | 1 | 0 | 0 | 6 |
| 388 | William Carey University | 386.000 | 14100.000 | 3882.000 | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Private not for profit | No | City: Small | Master's Colleges and Universities (larger pro... | 64.000 | 66.000 | 46.000 | 92.000 | 2958.000 | 0.581 | 0.836 | 0.488 | 6 | 0 | 0 | 2 | 3 |
| 389 | William Woods University | 387.000 | 25930.000 | 2136.000 | Plains IA KS MN MO NE ND SD | Private not for profit | No | Town: Distant | Master's Colleges and Universities (larger pro... | 84.000 | 68.000 | 45.000 | 100.000 | 11097.000 | 0.469 | 0.841 | 0.751 | 4 | 0 | 0 | 7 | 3 |
| 391 | Wingate University | 389.000 | 40170.000 | 3002.000 | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Private not for profit | No | Suburb: Large | Master's Colleges and Universities (smaller pr... | 62.000 | 60.000 | 47.000 | 99.000 | 17933.000 | 0.669 | 0.972 | 0.793 | 6 | 0 | 0 | 4 | 5 |
365 rows × 22 columns
plt.figure(figsize=(15,8))
sns.heatmap(df_small.corr(numeric_only=True),annot = True)
## or you can drop the non-numeric columns instead of setting numerically_only to True
## sns.heatmap(df.drop["Name","Type 1","Type 2"].corr(numeric_only= True),annot = True)
<Axes: >
A lot of correlation between variables, including rank, tuition, and graduating within four years. Other strongly correlated variables include percent admitted, and endowment assets. The categorical variables are also strongly correlated with several things. PCA will thus be valuable for clustering.
PCA (Principal Component Analysis) is a dimension reduction technique that consolidates key information from the features of a dataset into a new set of features that are uncorrelated to eachother and clearly explain a certain variance in the dataset.
Make a dictionary for converting the type columns (and also knowing which values relate to what type when we come to look at it again later).
#turn all columns into X
X = df_small.drop(['Name', 'Rank', 'Geographic region', 'Control of institution',
'Historically Black College or University',
'Degree of urbanization (Urban centric locale)',
'Carnegie Classification 2010: Basic'], axis=1)
#Create PCA model
pca = PCA(n_components=2)
pca_mdl = pca.fit_transform(X)
pca_df = pd.DataFrame(pca_mdl)
sns.scatterplot(x = pca_df[0], y = pca_df[1])
<Axes: xlabel='0', ylabel='1'>
... okay weird there are definitely some outliers and it looks wonky. How to handle these? We will test outlier removal and data scaling.
X.columns
Index(['Tuition', 'Enrollment Numbers',
'Percent of total enrollment that are White',
'Percent of total enrollment that are women',
'Graduation rate Bachelor degree within 4 years, total',
'Percent of freshmen receiving any financial aid', 'Endowment assets',
'Percent undergraduate', 'Percent fulltime', 'Percent admitted',
'Geographic region num', 'Control of institution num',
'Historically Black College or University num',
'Degree of urbanization (Urban centric locale) num',
'Carnegie Classification 2010: Basic num'],
dtype='object')
from sklearn.preprocessing import StandardScaler
scaler = preprocessing.StandardScaler()
numerical_cols = df_small.select_dtypes(include=['float64', 'int64']).columns
scaler = StandardScaler()
df_small[numerical_cols] = scaler.fit_transform(df_small[numerical_cols])
df_small.describe()
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\581297193.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_small[numerical_cols] = scaler.fit_transform(df_small[numerical_cols])
| Rank | Tuition | Enrollment Numbers | Percent of total enrollment that are White | Percent of total enrollment that are women | Graduation rate Bachelor degree within 4 years, total | Percent of freshmen receiving any financial aid | Endowment assets | Percent undergraduate | Percent fulltime | Percent admitted | Geographic region num | Control of institution num | Historically Black College or University num | Degree of urbanization (Urban centric locale) num | Carnegie Classification 2010: Basic num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 |
| mean | 0.000 | 0.000 | 0.000 | -0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | -0.000 | 0.000 | 3.603 | 0.540 | 0.027 | 2.282 | 4.764 |
| std | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 2.365 | 0.499 | 0.163 | 2.510 | 2.007 |
| min | -1.697 | -2.532 | -1.257 | -3.214 | -3.458 | -1.919 | -3.436 | -0.311 | -3.773 | -4.263 | -2.718 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 25% | -0.854 | -0.767 | -0.814 | -0.573 | -0.464 | -0.826 | -0.593 | -0.280 | -0.565 | -0.595 | -0.553 | 2.000 | 0.000 | 0.000 | 0.000 | 3.000 |
| 50% | -0.020 | -0.168 | -0.243 | 0.197 | -0.036 | -0.188 | 0.301 | -0.238 | 0.276 | 0.255 | 0.158 | 3.000 | 1.000 | 0.000 | 1.000 | 6.000 |
| 75% | 0.813 | 0.716 | 0.620 | 0.748 | 0.498 | 0.769 | 0.788 | -0.127 | 0.784 | 0.758 | 0.733 | 6.000 | 1.000 | 0.000 | 4.000 | 7.000 |
| max | 1.828 | 2.118 | 4.849 | 1.958 | 4.241 | 2.182 | 1.113 | 11.327 | 1.565 | 1.320 | 1.819 | 7.000 | 1.000 | 1.000 | 9.000 | 7.000 |
#scale the data without changing the column names
scaler = preprocessing.StandardScaler()
scaled_X = scaler.fit_transform(X)
scaled_X = pd.DataFrame(scaled_X, columns=X.columns)
scaled_X.describe()
| Tuition | Enrollment Numbers | Percent of total enrollment that are White | Percent of total enrollment that are women | Graduation rate Bachelor degree within 4 years, total | Percent of freshmen receiving any financial aid | Endowment assets | Percent undergraduate | Percent fulltime | Percent admitted | Geographic region num | Control of institution num | Historically Black College or University num | Degree of urbanization (Urban centric locale) num | Carnegie Classification 2010: Basic num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 | 365.000 |
| mean | 0.000 | 0.000 | 0.000 | -0.000 | 0.000 | -0.000 | 0.000 | -0.000 | 0.000 | 0.000 | -0.000 | 0.000 | 0.000 | -0.000 | -0.000 |
| std | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 | 1.001 |
| min | -2.532 | -1.257 | -3.214 | -3.458 | -1.919 | -3.436 | -0.311 | -3.773 | -4.263 | -2.718 | -1.526 | -1.083 | -0.168 | -0.910 | -2.378 |
| 25% | -0.767 | -0.814 | -0.573 | -0.464 | -0.826 | -0.593 | -0.280 | -0.565 | -0.595 | -0.553 | -0.679 | -1.083 | -0.168 | -0.910 | -0.880 |
| 50% | -0.168 | -0.243 | 0.197 | -0.036 | -0.188 | 0.301 | -0.238 | 0.276 | 0.255 | 0.158 | -0.255 | 0.923 | -0.168 | -0.511 | 0.617 |
| 75% | 0.716 | 0.620 | 0.748 | 0.498 | 0.769 | 0.788 | -0.127 | 0.784 | 0.758 | 0.733 | 1.015 | 0.923 | -0.168 | 0.685 | 1.116 |
| max | 2.118 | 4.849 | 1.958 | 4.241 | 2.182 | 1.113 | 11.327 | 1.565 | 1.320 | 1.819 | 1.439 | 0.923 | 5.958 | 2.680 | 1.116 |
pca = PCA(n_components=2)
pca_mdl = pca.fit_transform(df_small[['Percent of total enrollment that are White',
'Percent of total enrollment that are women',
'Graduation rate Bachelor degree within 4 years, total',
'Percent of freshmen receiving any financial aid', 'Endowment assets',
'Percent undergraduate', 'Percent fulltime', 'Percent admitted',
'Geographic region num', 'Control of institution num',
'Historically Black College or University num',
'Degree of urbanization (Urban centric locale) num',
'Carnegie Classification 2010: Basic num']])
pca_df = pd.DataFrame(pca_mdl)
sns.scatterplot(x = pca_df[0], y = pca_df[1])
<Axes: xlabel='0', ylabel='1'>
pca = PCA(n_components=2)
pca_mdl = pca.fit_transform(scaled_X)
pca_df = pd.DataFrame(pca_mdl)
sns.scatterplot(x = pca_df[0], y = pca_df[1])
Much much better! Let's explore these for clustering.
# from sklearn.decomposition import PCA
# from scipy.spatial import distance
# import numpy as np
# # Create PCA model
# pca = PCA(n_components=2)
# pca_mdl = pca.fit_transform(scaled_X)
# # Calculate the distance from each point to the origin
# distances = np.sqrt((pca_mdl**2).sum(axis=1))
# # Calculate the mean and standard deviation of the distances
# mean_distance = np.mean(distances)
# std_distance = np.std(distances)
# # Define outliers to be any point that is more than 3 standard deviations from the mean
# outliers = pca_mdl[distances > mean_distance + 3*std_distance]
#use outliers to find the index of the outliers
outliers_index = []
for i in range(len(pca_mdl)):
if pca_mdl[i] in outliers:
outliers_index.append(i)
#create a new df without the outliers
new_df = scaled_X.drop(outliers_index).copy()
#Create PCA model
pca = PCA(n_components=2)
pca_mdl = pca.fit_transform(new_df)
pca_df = pd.DataFrame(pca_mdl)
sns.scatterplot(x = pca_df[0], y = pca_df[1])
<Axes: xlabel='0', ylabel='1'>
#print outliers_index from scaled_X
scaled_X.iloc[outliers_index]
| Tuition | Enrollment Numbers | Percent of total enrollment that are White | Percent of total enrollment that are women | Graduation rate Bachelor degree within 4 years, total | Percent of freshmen receiving any financial aid | Endowment assets | Percent undergraduate | Percent fulltime | Percent admitted | Geographic region num | Control of institution num | Historically Black College or University num | Degree of urbanization (Urban centric locale) num | Carnegie Classification 2010: Basic num |
|---|
We will begin our modeling with K-Means Clustering.
Briefly explain how the K-Means clustering model works.
K-means clustering is a centroid clustering algorithm that partitions data into k number of clusters and then assigns each data point to the nearest cluster based on the shortest distance to the centroid (mean center point of a cluster).
Remember how we determine the best number of clusters (if we can't just manually look at it and decide)?
We look at the variance -- or, the sum of squared distances between the observations and their centroids. Note: "inertia" is the "within-cluster sum-of-squares criterion." See scikit learn documentation.
inertia = []
for k in range(1,8):
kmeans = KMeans(n_clusters=k, random_state=1).fit(scaled_X)
inertia.append(np.sqrt(kmeans.inertia_))
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2. warnings.warn( c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2. warnings.warn( c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2. warnings.warn( c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2. warnings.warn( c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2. warnings.warn( c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2. warnings.warn( c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2. warnings.warn(
Here, we see that the variance decreases significantly until 2, and then starts to decrease at a slower rate afterwards. Therefore, 2 is our preferred number of clusters.
plt.plot(range(1, 8), inertia, marker='s');
plt.xlabel('$k$')
plt.ylabel('Variance')
Text(0, 0.5, 'Variance')
In this case, what is the optimal number of clusters and why?
As explained above, the variance decreases pretty consistently until 3, so we are going to start with 4 clusters and look at those.
#create KMeans model
kmeans = KMeans(n_clusters=3, random_state=1).fit(scaled_X)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2. warnings.warn(
Now that we have fit our k-means clusters, let's just find what value (0 or 1, since we have set K=2) each row of data is so we can visualize it.
y = kmeans.fit_predict(scaled_X)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2. warnings.warn(
We are reusing the PCA (dimensionality reduction) data frame for the sake of visualizing 2-dimensional data (rather than 5).
sns.scatterplot(x = pca_df[0], y = pca_df[1], hue=y)
<Axes: xlabel='0', ylabel='1'>
We could also try plotting individual features to take a look.
sns.scatterplot(x = df_small['Rank'], y = df_small['Tuition'], hue=y)
<Axes: xlabel='Rank', ylabel='Tuition'>
Let's add our clusters back to the original DataFrame so we can take a look at some of the items.
df_small['Cluster'] = y
df_small
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\3172443776.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_small['Cluster'] = y
| Name | Rank | Tuition | Enrollment Numbers | Geographic region | Control of institution | Historically Black College or University | Degree of urbanization (Urban centric locale) | Carnegie Classification 2010: Basic | Percent of total enrollment that are White | Percent of total enrollment that are women | Graduation rate Bachelor degree within 4 years, total | Percent of freshmen receiving any financial aid | Endowment assets | Percent undergraduate | Percent fulltime | Percent admitted | Geographic region num | Control of institution num | Historically Black College or University num | Degree of urbanization (Urban centric locale) num | Carnegie Classification 2010: Basic num | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Princeton University | -1.697 | 1.567 | -0.713 | Mid East DE DC MD NJ NY PA | Private not for profit | No | Suburb: Large | Research Universities (very high research acti... | -0.738 | -1.106 | 2.091 | -2.137 | 11.327 | -0.326 | 1.188 | -2.636 | 2 | 0 | 0 | 4 | 7 | 2 |
| 1 | Columbia University | -1.688 | 2.118 | 0.807 | Mid East DE DC MD NJ NY PA | Private not for profit | No | City: Large | Research Universities (very high research acti... | -1.233 | -0.464 | 2.000 | -2.380 | 1.277 | -3.068 | 0.654 | -2.635 | 2 | 0 | 0 | 0 | 7 | 2 |
| 2 | Harvard University | -1.678 | 1.536 | 0.915 | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | -0.738 | -0.678 | 2.045 | -0.918 | 6.674 | -2.498 | -1.464 | -2.711 | 3 | 0 | 0 | 1 | 7 | 2 |
| 3 | Massachusetts Institute of Technology | -1.669 | 1.558 | -0.449 | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | -1.343 | -1.961 | 1.909 | 0.057 | 4.606 | -2.287 | 1.263 | -2.600 | 3 | 0 | 0 | 1 | 7 | 2 |
| 4 | Yale University | -1.660 | 1.856 | -0.384 | New England CT ME MA NH RI VT | Private not for profit | No | City: Midsize | Research Universities (very high research acti... | -0.573 | -0.678 | 2.182 | -2.055 | 7.354 | -1.931 | 1.310 | -2.653 | 3 | 0 | 0 | 1 | 7 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 386 | Western Kentucky University | 1.783 | -0.593 | 0.285 | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Public | No | City: Small | Master's Colleges and Universities (larger pro... | 1.023 | 0.284 | -0.780 | 0.544 | -0.307 | 1.103 | -0.779 | 1.450 | 6 | 1 | 0 | 2 | 3 | 0 |
| 387 | Wichita State University | 1.792 | -1.202 | -0.202 | Plains IA KS MN MO NE ND SD | Public | No | City: Large | Research Universities (high research activity) | 0.252 | -0.357 | -0.917 | 0.219 | -0.222 | 0.767 | -0.864 | 1.615 | 4 | 1 | 0 | 0 | 6 | 0 |
| 388 | William Carey University | 1.801 | -1.500 | -1.044 | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Private not for profit | No | City: Small | Master's Colleges and Universities (larger pro... | 0.307 | 1.140 | 0.177 | 0.463 | -0.297 | -0.942 | -0.143 | -0.646 | 6 | 0 | 0 | 2 | 3 | 1 |
| 389 | William Woods University | 1.810 | -0.634 | -1.184 | Plains IA KS MN MO NE ND SD | Private not for profit | No | Town: Distant | Master's Colleges and Universities (larger pro... | 1.408 | 1.354 | 0.131 | 1.113 | -0.256 | -1.777 | -0.093 | 0.623 | 4 | 0 | 0 | 7 | 3 | 1 |
| 391 | Wingate University | 1.828 | 0.408 | -1.115 | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Private not for profit | No | Suburb: Large | Master's Colleges and Universities (smaller pr... | 0.197 | 0.498 | 0.222 | 1.032 | -0.222 | -0.289 | 1.072 | 0.823 | 6 | 0 | 0 | 4 | 5 | 1 |
365 rows × 23 columns
sns.pairplot(df_small, hue = 'Cluster')
<seaborn.axisgrid.PairGrid at 0x1b6918e9290>
Making an interactive scatterplot (so it is easier to hover over individual data points.)
Also note that the x- and y-axis are our PCA values (from dimensionality reduction).
Below, we concat the dataframe along with the PCA values so that we can visualize properly. hover_data allows us to specify which columns we want to look at when hovering over each point.
Let's try agglomerative clustering with the same dataset as what we did above to see how it differs. But first, can you give a brief description of Agglomerative Clustering?
Agglomerative clustering is hierarchichal instead of centroid, and instead creates a top down or bottom up tree of clusters. Each point begins as its own cluster, then the closest pairs are merged. This repeats until a crterion is met.
AgglomerativeClustering?
Init signature: AgglomerativeClustering( n_clusters=2, *, affinity='deprecated', metric=None, memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', distance_threshold=None, compute_distances=False, ) Docstring: Agglomerative Clustering. Recursively merges pair of clusters of sample data; uses linkage distance. Read more in the :ref:`User Guide <hierarchical_clustering>`. Parameters ---------- n_clusters : int or None, default=2 The number of clusters to find. It must be ``None`` if ``distance_threshold`` is not ``None``. affinity : str or callable, default='euclidean' The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by :func:`sklearn.metrics.pairwise_distances` for its metric parameter. If linkage is "ward", only "euclidean" is accepted. If "precomputed", a distance matrix (instead of a similarity matrix) is needed as input for the fit method. .. deprecated:: 1.2 `affinity` was deprecated in version 1.2 and will be renamed to `metric` in 1.4. metric : str or callable, default=None Metric used to compute the linkage. Can be "euclidean", "l1", "l2", "manhattan", "cosine", or "precomputed". If set to `None` then "euclidean" is used. If linkage is "ward", only "euclidean" is accepted. If "precomputed", a distance matrix is needed as input for the fit method. .. versionadded:: 1.2 memory : str or object with the joblib.Memory interface, default=None Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory. connectivity : array-like or callable, default=None Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from `kneighbors_graph`. Default is ``None``, i.e, the hierarchical clustering algorithm is unstructured. compute_full_tree : 'auto' or bool, default='auto' Stop early the construction of the tree at ``n_clusters``. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be ``True`` if ``distance_threshold`` is not ``None``. By default `compute_full_tree` is "auto", which is equivalent to `True` when `distance_threshold` is not `None` or that `n_clusters` is inferior to the maximum between 100 or `0.02 * n_samples`. Otherwise, "auto" is equivalent to `False`. linkage : {'ward', 'complete', 'average', 'single'}, default='ward' Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. - 'ward' minimizes the variance of the clusters being merged. - 'average' uses the average of the distances of each observation of the two sets. - 'complete' or 'maximum' linkage uses the maximum distances between all observations of the two sets. - 'single' uses the minimum of the distances between all observations of the two sets. .. versionadded:: 0.20 Added the 'single' option distance_threshold : float, default=None The linkage distance threshold at or above which clusters will not be merged. If not ``None``, ``n_clusters`` must be ``None`` and ``compute_full_tree`` must be ``True``. .. versionadded:: 0.21 compute_distances : bool, default=False Computes distances between clusters even if `distance_threshold` is not used. This can be used to make dendrogram visualization, but introduces a computational and memory overhead. .. versionadded:: 0.24 Attributes ---------- n_clusters_ : int The number of clusters found by the algorithm. If ``distance_threshold=None``, it will be equal to the given ``n_clusters``. labels_ : ndarray of shape (n_samples) Cluster labels for each point. n_leaves_ : int Number of leaves in the hierarchical tree. n_connected_components_ : int The estimated number of connected components in the graph. .. versionadded:: 0.21 ``n_connected_components_`` was added to replace ``n_components_``. n_features_in_ : int Number of features seen during :term:`fit`. .. versionadded:: 0.24 feature_names_in_ : ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0 children_ : array-like of shape (n_samples-1, 2) The children of each non-leaf node. Values less than `n_samples` correspond to leaves of the tree which are the original samples. A node `i` greater than or equal to `n_samples` is a non-leaf node and has children `children_[i - n_samples]`. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node `n_samples + i`. distances_ : array-like of shape (n_nodes-1,) Distances between nodes in the corresponding place in `children_`. Only computed if `distance_threshold` is used or `compute_distances` is set to `True`. See Also -------- FeatureAgglomeration : Agglomerative clustering but for features instead of samples. ward_tree : Hierarchical clustering with ward linkage. Examples -------- >>> from sklearn.cluster import AgglomerativeClustering >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> clustering = AgglomerativeClustering().fit(X) >>> clustering AgglomerativeClustering() >>> clustering.labels_ array([1, 1, 1, 0, 0, 0]) File: c:\users\jesse\anaconda3\lib\site-packages\sklearn\cluster\_agglomerative.py Type: type Subclasses: FeatureAgglomeration
We have already done some pre-processing, but to keep things together for this practice, lets put them here again! We will be using the same "X" from K-Means with HP, Attack, Defense, Special Attack and Special Defense with Speed as well.
X = df[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) c:\Users\jesse\Desktop\project\test.ipynb Cell 95 line 1 ----> <a href='vscode-notebook-cell:/c%3A/Users/jesse/Desktop/project/test.ipynb#Y212sZmlsZQ%3D%3D?line=0'>1</a> X = df[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']] File c:\Users\jesse\anaconda3\Lib\site-packages\pandas\core\frame.py:3813, in DataFrame.__getitem__(self, key) 3811 if is_iterator(key): 3812 key = list(key) -> 3813 indexer = self.columns._get_indexer_strict(key, "columns")[1] 3815 # take() does not accept boolean indexers 3816 if getattr(indexer, "dtype", None) == bool: File c:\Users\jesse\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:6070, in Index._get_indexer_strict(self, key, axis_name) 6067 else: 6068 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr) -> 6070 self._raise_if_missing(keyarr, indexer, axis_name) 6072 keyarr = self.take(indexer) 6073 if isinstance(key, Index): 6074 # GH 42790 - Preserve name from an Index File c:\Users\jesse\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:6130, in Index._raise_if_missing(self, key, indexer, axis_name) 6128 if use_interval_msg: 6129 key = list(key) -> 6130 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 6132 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique()) 6133 raise KeyError(f"{not_found} not in index") KeyError: "None of [Index(['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed'], dtype='object')] are in the [columns]"
Let's figure out how many clusters is optimal for this model. Agglomerative Clustering used a dendrogram to determine this number!
#Create and display a dendrogram
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title('Dendrogram')
plt.xlabel('Pokemon')
plt.ylabel('Euclidean distances')
plt.axhline(y=825, color='r', linestyle='--')
plt.axhline(y=1575, color='r', linestyle='--')
dend = shc.dendrogram(shc.linkage(X, method='ward'))
To read a dendrogram to find the optimal number of clusters, find the section with the highest width. the number of lines (in this example the blue lines) intersecting the section is the optimal number of clusters. Can you tell how many clusters is the optimal amount?
The largest width on this graph is at the final merge at the very top, so the optimal number of clusters is 2.
After determining what the optimal number of clusters is, input it into the model implementation below!
#Implement model
agglo = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
Now lets fit and create some prediction to visualize the clusters!
y_agglo = agglo.fit_predict(X)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_agglomerative.py:1005: FutureWarning: Attribute `affinity` was deprecated in version 1.2 and will be removed in 1.4. Use `metric` instead
Now let's visualize! We will once again be using PCA to do so.
sns.scatterplot(x = pca_df[0], y = pca_df[1], hue=y_agglo)
<Axes: xlabel='0', ylabel='1'>
Now lets look again at K-Means visual again to compare
sns.scatterplot(x = pca_df[0], y = pca_df[1], hue=y)
<Axes: xlabel='0', ylabel='1'>
Can you note any differences or similarities you may see?
The split in k-means is much cleaner than in agglomerative. There is barely any overlap in k-means, while agglomerative classifies several things as overlapping.
Lets also again look at some seperate features. Will be again looking at attack and defense just as we did with K-means!
sns.scatterplot(x = df['Attack'], y = df['Defense'], hue=y_agglo)
<Axes: xlabel='Attack', ylabel='Defense'>
Once again, pulling up the K-means visual for quick comparison. Can you not any similarities or differences once again?
Agglomerative resulted in a minutely tighter clustering of cluster 1 objects, with a few points with high defense scores being classified as 0 instead of 1. However, it also identified several more points in the <100 range as class 0. Overall, however, the shift is relatively minor to the naked eye and just makes drawing a split a little bitmore difficult.
sns.scatterplot(x = df['Attack'], y = df['Defense'], hue=y)
<Axes: xlabel='Attack', ylabel='Defense'>
Lets make an interactive scatterplot again! Remember to note that the x- and y-axis are our PCA values (from dimensionality reduction). Below, we concat the dataframe along with the PCA values so that we can visualize properly. hover_data allows us to specify which columns we want to look at when hovering over each point.
y_a_df = pd.DataFrame(y_agglo, columns=['Cluster (Agglomerative)'])
new_a_df = pd.concat([df, y_a_df], axis=1)
fig = px.scatter(pd.concat([new_a_df, pca_df], axis = 1),
x = 0, y = 1, color='Cluster (Agglomerative)', hover_data=['Name','Type 1','Type 2','Legendary','Total'])
fig.show()